HuggingFaceDataset

`HuggingFaceDataset`

Bases: Dataset

Streaming dataset backed by a HuggingFace datasets source.

Each row produced by datasets.load_dataset is rendered through the Jinja2 input_template / output_template to JSON, validated against the corresponding DataModel (or synalinks.ChatMessages when None), and accumulated into batches of size batch_size. Each batch is yielded as (x, y) — numpy object arrays of DataModel instances — matching the format synalinks' GeneratorDataAdapter expects.

Templates should render to JSON. Use Jinja's tojson filter for safe string escaping.

Example:

ds = synalinks.HuggingFaceDataset(
    path="gsm8k",
    name="main",
    split="train",
    input_data_model=MathQuestion,
    input_template='{"question": {{ question | tojson }}}',
    output_data_model=NumericalAnswer,
    output_template='{"answer": {{ answer.split("####")[-1].strip() | tojson }}}',
    batch_size=8,
)
program.fit(x=ds())

Parameters:

Name	Type	Description	Default
`path`	`str`	The HuggingFace dataset repo / builder name (first positional argument of `datasets.load_dataset`).	required
`name`	`str`	Optional. The dataset configuration name.	`None`
`split`	`str`	Optional. The split to load (e.g. `"train"`, `"test"`). When `None`, the entire `DatasetDict` is iterated in split order.	`None`
`revision`	`str`	Optional. The dataset revision (commit hash, branch, tag).	`None`
`streaming`	`bool`	If `True` (default), use HF's `IterableDataset` so rows are downloaded on demand — required for datasets that don't fit on disk. The generator naturally terminates when the source is exhausted, so the trainer ends the epoch on its own; pass `steps_per_epoch` only if you also want shorter epochs.	`True`
`input_data_model`	`DataModel`	See `Dataset`.	`None`
`input_schema`	`dict \| str`	See `Dataset`.	`None`
`input_template`	`str`	See `Dataset`.	`None`
`output_data_model`	`DataModel`	See `Dataset`.	`None`
`output_schema`	`dict \| str`	See `Dataset`.	`None`
`output_template`	`str`	See `Dataset`.	`None`
`batch_size`	`int`	Examples per yielded batch. Defaults to `1`.	`1`
`limit`	`int`	Optional. See `Dataset`. Caps how many rows are consumed (across all splits). Also makes `__len__` available for streaming datasets.	`None`
`repeat`	`int`	See `Dataset`.	`1`
`**kwargs`	`Any`	Forwarded to `datasets.load_dataset` (e.g. `data_files`, `token`, `trust_remote_code`, ...).	`{}`

Source code in synalinks/src/datasets/huggingface_dataset.py

@synalinks_export(
    [
        "synalinks.HuggingFaceDataset",
        "synalinks.datasets.HuggingFaceDataset",
    ]
)
class HuggingFaceDataset(Dataset):
    """Streaming dataset backed by a HuggingFace ``datasets`` source.

    Each row produced by ``datasets.load_dataset`` is rendered through the
    Jinja2 ``input_template`` / ``output_template`` to JSON, validated
    against the corresponding ``DataModel`` (or ``synalinks.ChatMessages``
    when ``None``), and accumulated into batches of size ``batch_size``.
    Each batch is yielded as ``(x, y)`` — numpy object arrays of
    ``DataModel`` instances — matching the format synalinks'
    ``GeneratorDataAdapter`` expects.

    Templates should render to JSON. Use Jinja's ``tojson`` filter for
    safe string escaping.

    Example:

    ```python
    ds = synalinks.HuggingFaceDataset(
        path="gsm8k",
        name="main",
        split="train",
        input_data_model=MathQuestion,
        input_template='{"question": {{ question | tojson }}}',
        output_data_model=NumericalAnswer,
        output_template='{"answer": {{ answer.split("####")[-1].strip() | tojson }}}',
        batch_size=8,
    )
    program.fit(x=ds())
    ```

    Args:
        path (str): The HuggingFace dataset repo / builder name (first
            positional argument of ``datasets.load_dataset``).
        name (str): Optional. The dataset configuration name.
        split (str): Optional. The split to load (e.g. ``"train"``,
            ``"test"``). When ``None``, the entire ``DatasetDict`` is
            iterated in split order.
        revision (str): Optional. The dataset revision (commit hash,
            branch, tag).
        streaming (bool): If ``True`` (default), use HF's
            ``IterableDataset`` so rows are downloaded on demand —
            required for datasets that don't fit on disk. The generator
            naturally terminates when the source is exhausted, so the
            trainer ends the epoch on its own; pass ``steps_per_epoch``
            only if you also want shorter epochs.
        input_data_model (DataModel): See ``Dataset``.
        input_schema (dict | str): See ``Dataset``.
        input_template (str): See ``Dataset``.
        output_data_model (DataModel): See ``Dataset``.
        output_schema (dict | str): See ``Dataset``.
        output_template (str): See ``Dataset``.
        batch_size (int): Examples per yielded batch. Defaults to ``1``.
        limit (int): Optional. See ``Dataset``. Caps how many rows are
            consumed (across all splits). Also makes ``__len__``
            available for streaming datasets.
        repeat (int): See ``Dataset``.
        **kwargs (Any): Forwarded to ``datasets.load_dataset`` (e.g.
            ``data_files``, ``token``, ``trust_remote_code``, ...).
    """

    def __init__(
        self,
        path,
        *,
        name=None,
        split=None,
        revision=None,
        streaming=True,
        input_data_model=None,
        input_schema=None,
        input_template=None,
        output_data_model=None,
        output_schema=None,
        output_template=None,
        batch_size=1,
        limit=None,
        repeat=1,
        **kwargs,
    ):
        super().__init__(
            input_data_model=input_data_model,
            input_schema=input_schema,
            input_template=input_template,
            output_data_model=output_data_model,
            output_schema=output_schema,
            output_template=output_template,
            batch_size=batch_size,
            limit=limit,
            repeat=repeat,
        )
        self.path = path
        self.name = name
        self.split = split
        self.revision = revision
        self.streaming = streaming
        self.load_kwargs = kwargs

        self._dataset = load_dataset(
            path,
            name=name,
            split=split,
            revision=revision,
            streaming=streaming,
            **kwargs,
        )

    def _iter_rows(self):
        if hasattr(self._dataset, "keys") and not self.split:
            for split_name in self._dataset.keys():
                yield from self._dataset[split_name]
        else:
            yield from self._dataset

    def __len__(self):
        if self.streaming and self.limit is None:
            raise NotImplementedError("Streaming HF datasets have unknown length.")
        if self.limit is not None:
            num_rows = self.limit
        elif hasattr(self._dataset, "keys") and not self.split:
            num_rows = sum(len(self._dataset[s]) for s in self._dataset.keys())
        else:
            num_rows = len(self._dataset)
        return self._total_batches(num_rows)

`load_split(path, *, name=None, split, input_data_model, input_template, output_data_model=None, output_template=None, limit=None, **load_kwargs)`

Materialize a single HF split into one (x, y) (or (x,)) pair.

A thin convenience wrapper around HuggingFaceDataset(streaming=False).materialize() that takes the same arguments as the HuggingFaceDataset constructor and returns numpy object arrays directly.

Use this when you want a whole HF split as in-memory NumPy arrays — for evaluation, head/tail train/test splits via split_train_test, or quick experiments. For streaming use cases, construct HuggingFaceDataset directly.

Source code in synalinks/src/datasets/huggingface_dataset.py

@synalinks_export(["synalinks.datasets.load_split"])
def load_split(
    path,
    *,
    name=None,
    split,
    input_data_model,
    input_template,
    output_data_model=None,
    output_template=None,
    limit=None,
    **load_kwargs,
):
    """Materialize a single HF split into one ``(x, y)`` (or ``(x,)``) pair.

    A thin convenience wrapper around
    ``HuggingFaceDataset(streaming=False).materialize()`` that takes
    the same arguments as the ``HuggingFaceDataset`` constructor and
    returns numpy object arrays directly.

    Use this when you want a whole HF split as in-memory NumPy
    arrays — for evaluation, head/tail train/test splits via
    ``split_train_test``, or quick experiments. For streaming use
    cases, construct ``HuggingFaceDataset`` directly.
    """
    ds = HuggingFaceDataset(
        path=path,
        name=name,
        split=split,
        streaming=False,
        input_data_model=input_data_model,
        input_template=input_template,
        output_data_model=output_data_model,
        output_template=output_template,
        batch_size=None,
        limit=limit,
        **load_kwargs,
    )
    return ds.materialize()